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I. RELATED APPLICATIONS 

This application relates to provisional application Serial No. 60/148,652 filed on 
August 13, 1999, drawing priority therefrom. 



II. BACKGROUND OF THE INVENTION 

The present invention relates to the field of digital signal processor (DSP) architectures, 
and more particularly to DSP architectures with optimized architectures. 

With the increasing commercial importance of DSP-intensive applications, such as 
wireless communication, modems, and computer telephony, has come an increasing recognition 
of the benefit of implementing DSP functions on a CPU. Not only are CPUs usually needed for 
memory management, user interface and Internet Protocol software, CPUs also have excellent 
third-party software tool support. 

Implementing certain DSP algorithms, such as the FIR Filter or Discrete Cosine 
Transform (DC7), in software, however, may degrade performance up to an order of magnitude 
as compared to specialized DSPs. Another difficulty is the problem of deterministic real-time 
allocation in sophisticated CPUs. 

Some vendors have tried to address these problems by offering auxiliary processing 
components. DSP coprocessors, for example, use separate instruction sets, instruction stores and 
execution units, and DSP accelerators share the same I-stream with the CPU but have separate 
execution units. These approaches, however, impose a substantial burden on the CPU in 
managing DSP functions. 



ffl. SUMMARY OF THE INVENTION 

Systems and methods consistent with the present invention provide for an alternate DSP 
architecture configuration that allows for more efficient implementation of Dsp algorithms and 
use of CPU resources. 

A digital signal processor consistent with this invention comprises two execution 
pipelines capable of executing RISC instructions; instruction fetch logic that simultaneously 
fetches two instructions and routes them to respective pipelines; and control logic to allow the 
pipelines to operate independently. 

Another digital signal processor consistent with this invention and capable of integrating 
subopcodes into an established CPU instruction set comprises a memory that stores instructions 
having opcodes; an instruction decoder that identifies a relocatable opcode to designate 
subopcodes; and a subopcode detector that decodes subopcodes if the instruction decoder 
identifies the relocatable opcode. 

Still another digital signal processor consistent with this invention comprises a register 
pair; and means for executing a multiply instruction on a number stored in the register pair, 
including first means for performing multiply instructions on higher-order portions of each 
register in the register pair, second means for performing multiply instructions on the remaining 
portions of each register in the register pair, and third means for combining the results from the 
first and second means. 

A circular buffer control circuit consistent with this invention comprises a first number of 
circular buffer start registers; a first number of circular buffer end registers, each associated with 
a different one of the circular buffer start registers; and circular buffer control logic including 
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means for comparing a pointer to and address in a selected one of the circular buffer end 
registers, and means for restoring the address in the one of the circular buffer start registers 
associated with the selected circular buffer end register if the pointer matches the address in the 
selected circular buffer end register. 
IV. BRIEF DESCRIPTION OF DRAWINGS 

The accompanying drawings, which are incorporated in and constitute a part of this 
specification, illustrate one embodiment of the invention and, together with the description, serve 
to explain the objects, advantages, and principles of the invention. In the drawings: 

Fig. 1 is a block diagram of a DSP architecture consistent with this invention; 

Fig. 2 is a data flow diagram of the superscalar instruction issue consistent with this 
invention; 

Fig. 3 is a table indicating the instruction select logic consistent with this invention; 
Fig. 4 shows a superscalar RALU datapath consistent with this invention; 
Fig. 5 shows an example of a MMD register consistent with this invention; 
Fig. 6 is an illustration of a dual MAC datapath consistent with this invention; 
Figs. 7A 3 7B 5 and 7C show some of the data arithmetic modes; 
Figs. 8A, 8B, 8C, and 8D contain a table of the instructions supported by an 
implementation consistent with this invention; 

Fig. 9 is a table showing the assignment of instructions to different pipelines; 
Fig. 10 is a block diagram of circular buffers consistent with this invention; 
Figs. 1 1 A, 1 IB and 1 1C show a table summarizing vector addressing instructions; 
Figs. 12A and 12B show a table with instructions when a saturation option is provided; 
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Figs. 13A and 13B show a table with additional ALU operations; 

Fig. 14 contains a table with conditional operations; 

Fig. 15 contains a table with the cycles required between instructions; 

Fig. 16 is a block diagram of several coprocessor registers; and 

Figs. 17A-H illustrate additional LEXOP codes. 
V. DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Reference will now be made in detail to embodiments consistent with this invention that 
are illustrated in the accompanying drawings. The same reference numbers in different drawings 
generally refer to the same or like parts. 

A. OVERVIEW 

Performance critical DSP algorithms such as FIR filters need to perform 
multiply-accumulate on memory-based operands. Typically, DSP architecture have a CISC 
instruction set with memory-based operands. RISCs on the other hand have load/store 
architectures and register operands. Extending the RISC into DSP with superscalar issue allows 
one instruction to load the register from memory and another to execute on register-based 
operands that were loaded earlier. 

A RISC-DSP consistent with the present invention can achieve a system clock speed of 
200 MHZ and a peak computational power of 400 million multiply-accumulate (MAC) 
operations per second when implemented in 0.18 //m technology, which is comparable to the 
highest-performance DSPs available. Also, tightly integrating DSP extensions into an existing 
instruction set, such as the MIPS ISA, gives access to a wide variety of third-party tools and 
allows programmers to switch seamlessly from RISC code to DSP code. The extensions include 
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(1) multiply-accumulate (MAC) instructions, including guard bits (with support for fractional 
arithmetic mode), and saturation and rounding for high-fidelity DSP arithmetic operations that 
have not previously been available in RISC processors, (2) dual 16-bit versions of all ALU 
operations, (3) post-modified memory pointers with hardware circular buffer support, (4) zero- 
overhead loop counter, which can be interrupted and nested, (5) conditional move, (6) 
specialized ALU operations, (7) prioritized low-overhead interrupts, and (8) load/store of two 32- 
bit general registers with a single instruction. 

These ISA extensions are accomplished by introducing an I-Format opcode called 
LEXOP, which creates 64 additional subop codes encoded in INST[5:0]. 

Fig. 1 shows an example of a DSP 100 architecture consistent with this invention. DSP 
100 includes a Coprocessor 0 (CPO)llO, a Register File Arithmetic Logic Unit (RALU) 120, 
instruction memory and issue logic 130, data memory 140, and LBC 150. IADDR (instruction 
address b) bus 160 links coprocessor 0 110, instruction memory and issue logic 130, and LBC 
150. DADDR (data address) bus 170 links RALU 120, data memory 140, and LBC 150. 
Instruction pathways INSTA 163 and INSTB 166 connect between CP0 1 10, RALU 120, 
instruction memory and issue logic 130, and LBC 150. DBUS (data bus) 175 connects between 
CP0 1 10, RALU 120, data memory 140, and LBC 150. 

CI (customer interface) 180 acts as an interface to a customer coprocessor and connects to 
INSTA pathway 163 and DBUS 175. LBC 150 also interfaces to DI (data in) bus 192, DO (data 
out) bus 194, address bus 196, and control bus 198. 

One implementation of the DSP consistent with this invention adds a stage to a dual- 
issue, five-stage pipeline to form dual-issue six-stage pipelines. The additional stage is a D 



(Decode) stage. This superpipeline design can achieve higher performance and isolate the 
processor's logic from customer-supplied instruction memories. Sustained performance near the 
peak 400 million MACs per second can be realized for inner loops of key DSP algorithms. 

One pipeline is preferably a load/store pipe. It includes data memory access and all 
instructions except multiply and divide operations. Another pipeline is preferably a multiply- 
accumulate pipe. It includes a MAC and ALU, each with dual 16-bit operations. 

Preferably, DSP algorithms will use the load/store pipe to load a pair of operands into a 
general register while executing dual MAC operations in multiply-accumulate pipe on earlier 
data. Decoupling register loads from the MAC processing allows loop unrolling and takes 
effective advantage of the thirty-two general registers for temporary storage. In addition, dual- 
issue allows the DSP to achieve the memory bandwidth required by DSP within a RISC 
architecture, and dual 16-bit SIMD operations take full advantage of the 32-bit RISC datapaths. 

B. Description of Key Components and Features 
1. Superscalar Architecture 

As explained above, a DSP consistent with this invention preferably issues dual 32-bit 
instructions to two distinct six-stage execution pipelines. Fig. 2 shows data paths for the 
instruction issue. An instruction cache and IRAM 210 in instruction memory and issue logic 130 
(Fig. 1) can fetch two 32-bit instructions simultaneously. Following the superscalar instruction 
buffer and issue logic described below, the instructions issue as IB_S_R to one pipeline, called 
Pipe B and IAJS_R to the other pipeline, called Pipe A. Pipe A would be the "load/store pipe" 
described above because it uniquely supports the load and store operations. Pipe B, then, would 
be the multiply-accumulate pipe. 
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The dual-issue design offers significant performance advantages. For example, peak 
computational performance for DSP algorithms typically requires at least one memory operand 
per instruction cycle. The dual-issue superscalar design, on the other hand, allows a memory 
operand to be loaded by one instruction while the MAC operates on earlier register-based data. 

The RISC data path is 32-bits, but few DSP algorithms require more than 16-bits of 
precision. This allows the simultaneous fetching of two values from memory, which further 
improves performance. 

The six stages of the execution pipelines are as follows: 
Stage 1 I Instruction fetch 
Stage 2 D Decode 
Stage 3 S Source fetch 
Stage 4 E Execution 
Stage 5 M Memory data select 
Stage 6 W Write back to register file 
To avoid degrading performance, the superscalar issue logic can operate during the 
Decode-Stage of the pipeline that occurs after I (Instruction Fetch) and before S (Source Fetch). 
The D stage allows a better system clock cycle time over that of a five-stage pipeline. Support 
for fully synchronous instruction memories also isolates the processor logic from the customer- 
supplied memories, which facilitates integration. 

Although the D-Stage incurs a two-cycle penalty on branch prediction failure as 
compared to the one-cycle penalty from a five-stage pipeline, the effect of this penalty does not 
usually manifest. For example, because branches are predicted taken in all cases, no wasted 



cycles are incurred for backward branches used as loop counters except in the final cycle where 
the loop "falls through." 

A two-bit Valid register 220 (V0 and VI) is updated with each fetch to indicate whether 
instructions 10 230 and II 235 are valid. The Valid register can indicate three states: (1) "neither 
valid," which occurs following a cache miss, (2) "both valid/' which occurs following a cache 
hit; or (3) "VO=invalid/Vl==valid," which occurs following a branch to II . Later, one or both 
instructions may issue to Pipes A and B. If only one instruction from a valid pair (e.g., 10) issues, 
Valid register 220 will be appropriately updated and instruction fetch will be stalled while the 
second instruction issues. V0=invalid/V Invalid occurs following a branch to II. V0=valid/ 
Vl=invalid will not occur because preference is given to 10 (the first instruction in program order) 
if only one instruction can issue. The next instruction II issues before fetching a new pair of 
valid instructions, I0/TI. 

Decoding register 220 belongs to instruction analysis logic 240, which, along with the 
Instruction Select logic, is implemented in the D-Stage. Instruction analysis logic 240 generates 
five key signals; OeA, leA, OeB, leB. OeA indicates whether 10 can execute in Pipe A. This 
decision is based on the 10 opcode and the resources available in Pipe A. For example, if 10 is an 
ADD and Pipe A can execute an ADD, then OeA = 1; if 10 is a MAC and Pipe A has no MAC 
then OeA = 0. The complexity of this logic can minimized by careful encoding operations and by 
careful partitioning of operations between the two Pipes. One simplification could be to partition 
the DSP so that the Pipe B instruction, IB_S_R, is routed only to RALU 120 (Fig. 1). 



A special signal IdO indicates that both 10, II are valid and II (the higher logical order) 
and depends on 10. This will occur if the result of 10 is used as a source by IL In this case, only 
10 will issue. Analysis of the other signals is shown below: 



Signal 


Meaning 


OeA 


10 is valid, 10 can be executed in Pipe A 


leA 


11 is valid, 11 can be executed in Pipe A 


OeB 


10 is valid, 10 can be executed in Pipe B 


leB 


11 is valid, 11 can be executed in Pipe B 


IdO 


10, 11 are valid and 11 depends on 10: 
- source of U = dest of 10, or 
- 11 updates register file and 
10 post-modifies load/store pointer 



Based on the Instruction Analysis, instructions are issued to Pipes A and B. Three 
outcomes are possible for each Pipe, 
IA(IB) 10 
IA(IB)^I1 
IA(IB)<-NOP 

Data flow is effected by input multiplexers 260, 262, instruction registers 264, 266, 
output multiplexers 268, 270, and output registers 272, 274, An input multiplexer 276 and an 
output register 278 are associated with Valid register 220, 



Instruction select logic 250 controls output multiplexers 272, 274 according to Table 300 
in Fig. 3. Table 300 also specifies the update to the Valid register as a consequence of the issue 
decision. 

"V next" in Table 300 indicates that 10/11 are updated with a new instruction pair 
(following cache load) and both are valid, unless a branch to an odd address occurs. In that case, 
10 is not valid and II is valid. If two instructions are valid, but only one instruction issues, the 
signal "Stall" will be active to stall the instruction fetch by one cycle until the 2nd instruction has 
issued. 

The Order Register ORD, which lies in instruction memory and issue logic 130 (Fig. 1) 
indicates which instruction is first in program order. For example if IA «- II and IB 10, then 
ORD <- "B" to indicate that the instruction in Pipe B is first in program order. ORD must be 
examined by later-stage control in two cases. First, if both instructions attempt to write the same 
general register, ORD will select the latter in program order. Second, if both instructions 
generate exceptions, ORD will select the address of the earlier instruction into the Exception PC 
(EPC) register. This logic is preferably executed during the D-Stage. IB_S_R, IAJS JR and 
ORD_S_R issue from registers on the rising edge of the S-Stage clock cycle, and are broadcast to 
other sections. 

To improve performance, the DSP's superscalar implementation can use a "sliding" two- 
instruction buffer with a third instruction register, 12. If only one instruction (10) issues, the 
contents of II are shifted into 10, the contents of 12 are shifted into II and a new instruction is 
fetched into 12. As a result, a pair of instructions will always be available in 10 and II for 
potential dual issue, except for the cycle following a branch or jump. 
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The third instruction buffer register 12 is required because another pair of unaligned 
instructions must be available for use, following the dual issue of an unaligned instruction pair 
(10 and II addresses are not both on the same factor-of-8 boundary). If memory fetches only 
aligned instruction pairs, a third instruction buffer is required. 

The compression signal indicates whether code compression has been enabled. When 
compression is enabled, each 32-bit fetch is interpreted as a pair of 16-bit instructions. Upon 
fetching 10(32), the first 16-bit instruction of the pair will be loaded into 10 REG and the second 
into II REG. Similarly, when 11(32) is fetched, the first instruction of the pair will be loaded into 
10 REG and the second into II REG. D-Stage logic follows the approach previously described 
with the shorter instruction (e.g., 16-bit) opcodes being analyzed for potential issue, then selected 
into IB REG and IA REG. The critical Register File read addresses for shorter instruction 
opcodes are resolved during the D-stage so that register file access for shorter instructions, as for 
longer (e.g., 32-bit) instructions, can begin on the rising edge of the S-Stage clock. The state of 
10 valid and II invalid does not occur because there is no out-of-order execution. 
2, RALU Datapath 

Fig. 4 illustrates the Superscalar RALU datapath 400 consistent with this invention. 
Operations are divided between Pipe A and Pipe B in such a way that RALU 120 (Fig. 1) is the 
only major section of the processor which requires both Pipe A and B instructions. Coprocessor 
0 1 10, as well as the customer-defined coprocessors 1-3 (not shown), only require the Pipe A 
instruction. 



An 8-port (4-read/4-write) general register file 410 supports the dual execution pipelines. 
The 8-port architecture allows a large portion of both pipes to be fully replicated, simplifying 
design and improving likelihood of two instructions being able to dual issue. 

As Fig. 4 shows, ALU A 420 and ALU B 430 connect to register file 410 through the 
respective read and write ports. In each pipe, one write port is dedicated to register file updates 
from the data bus (e.g. , loads, or MFCz, CFCz - moves from a coprocessor ). The remaining 
three ports (two read ports and one write port) are available for the other operations assigned to 
that Pipe. As a result, loads, including "twinword" loads of register pairs can dual-issue with any 
MAC or ALU instruction assuming no data-dependency. The nomenclature "twinword" 
distinguishes these operations from "doubleword" operations which (in other extensions) access 
a single 64-bit general register. The "twinword" load/store (Load/Store of pairs of 32-bit 
registers (e.g., rl6, rl7) in one 64-bit transfer from memory) allow the dual MACs to be kept 
busy at all times. 

All ALU operations are available in both Pipe A and Pipe B. DSP extensions to memory 
addressing, such as pointer post-modification and circular buffer addressing described below, are 
preferably unique to Pipe A, Also, coprocessor operations and all "sequencing control 
instructions" (branches, jumps) are unique to Pipe A. As a result, Pipe B instructions are not 
routed to the coprocessors. This is shown in Fig. 4 with ALU B being connected to Custom 
Engine Interface (CEI) 450 and dual MAC 44, which preferably resides in RALU 120 (Fig. 1). 
CEI 450 is optionally available for customer proprietary operations only in Pipe B. This feature 
allows the customer extensions to maintain high throughput since they can dual-issue with Load 
and Store instructions which issue to Pipe A. 
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3. MMD register 

Fig. 5 shows MAC Mode ("MMD") register 500, which is preferably a new Radiax User 
register (24) that is accessed using Radiax User Move instructions MTRU and MFRU. MMD 
register 500 is read using the MFLXCO instruction, a variant of the MFCO instruction and is 
loaded using a MTLXCO instruction, a variant of the MTCO instruction. Field 510 [31-5] are the 
mask bits for eight low-overhead interrupts described below. Field MF 520 selects arithmetic 
mode for multiplies in the Dual Mac. A "0" means use integer arithmetic mode, and a "1" means 
use fractional arithmetic mode. 

Field MS 530 selects saturation boundary in the Dual Mac accumulators as follows. A 
"0" means saturate at 40 bits, and a "1" means saturate at 32 bits. Field MT 540 selects 
truncation of 32x32 bit multiplies in the Dual Mac. A "0" means perform full 32x32 bit multiply 
(sum all four partial products), and a "1" means omit partial product rX[15:00] x rY[15:00] when 
performing 32x32 bit multiply. 

Field RND 550 selects the rounding mode used in the RNDA2 instruction. A "00" means 
convergent rounding, and a "01" means round to nearest number. In convergent rounding, which 
is sometimes called "round-to-nearest-even," the numbers are rounded to the nearest number. 
When the number to be rounded is midway between two numbers representable in the smaller 
format, round to the number is rounded to the even number. The rounded result will always have 
0 in the lsb. If the Isb left of the roundoff point is random, convergent rounding is unbiased. 

In rounding to the nearest number, the number is rounded to the more positive number 
when the number to be rounded is midway between two numbers representable in the smaller 
format. This rounding mode is common because it is easily implemented by always adding 
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0..0a10...0 to the number to be rounded. Digits to the right of "A" are dropped after rounding. 
Upon reset all bits in the MMD register 500 are initialized to 0, 
4. MAC datapath 

Fig. 6 illustrates a Dual MAC datapath 600 consistent with this invention. The major 
subsystems are two 16-bit multiply-accumulate datapaths 610 and 620, each with a temporary 
register 612 and 622, respectively, and divide unit 630. Each multiplier 610 and 620 feeds a 
corresponding 32-bit product register 614 and 624, two accumulator units 616 and 626, four 40 
bit accumulator registers 618 and 628, and output scalers 619 and 629. A 40-bit 
Add/Subtract/Dual Round Unit 650 provides optional saturation and overflow for the two 
accumulators 616 and 626. 

Multiply-accumulate datapaths can operate on 16-bit input data, either individually or in 
parallel. Preferably, the same assembler mnemonic is used for individual or parallel operation. 
The output register specified determines whether MAC0 or MAC1 or both, operate. For 
example, 

MADDA2 m0/, r2, r3 // MAC0: m0/ ^m0/ + r2[15:00] * r3[15;00] 

//MAC1: IDLE 
MADDA2 mOh, r2, r3 // MAC0: IDLE 

//MAC1: mOh^mOh + r2[31:16] * r3[31:16] 
MADDA2 mO, r2, r3 // MAC0: m0/ ^m0/ + r2[15:00] * r3[15:00] 

//MAC 1: mOl <-m0/ + r2[31:16] * r3[31:16] 

Each multiplier 610 and 620 can initiate a new 16 x 16-bit product every cycle (single 
cycle throughput). Each 16 xl 6-bit multiply-accumulate preferably completes in three cycles. 

Temporary registers 612 and 622 and product registers 614 and 624 show that multipliers 
610 and 620 require two cycles, but have single cycle throughput. Temporary registers 612 and 
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622, and product registers 614 and 624, however, are preferably not accessible by the 
programmer. Thus, there are two delay slots for multiplication or multiply-accumulate. For 
example, 

Inst 1 : MADDA2 mlh, r2, r3 

Inst 2: delay slot 1 // new mlh is not available 

Inst 3 : delay slot 2 // new mlh is not available 

Inst 4; MFA r3, mlh // new mlh is available 

Mlh can be referenced by MFA in Inst2 (lnst3), but two (one) stall cycles will be 
incurred. The number of stall cycles in DSP algorithms are expected to be minimal because 
many products are often accumulated before the accumulator output must be stored. In a 64-tap 
FIR, for example, sixty-four terms are accumulated before the filter sample is updated in 
memory. Also, the four accumulator pairs allow loops to be "unrolled" so that up to three 
additional independent MAC operations can be initiated before the result of the first is available. 

Compared to a typical RISC multiply-accumulate unit, a MAC consistent with this 
invention includes a number of features critical to high-fidelity DSP arithmetic, such as 
accumulator guard bits, fractional arithmetic, saturation, rounding, and output scaling. These 
features are optionally selected by opcode and/or mode bits in MMD register 500, and are 
compatible with conventional integer arithmetic. 

Figs. 7A, 7B, and 7C show some of the data arithmetic modes consistent with the present 
invention. These modes are used to select between several available options for the following 
features of the MAC's arithmetic: (i) truncation mode, (ii) saturation mode, (iii) fractional/integer 
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arithmetic mode, and (iv) rounding mode. These modes and their optional settings are discussed 
below. 

Accumulation is preferably performed at 40-bit precision, using eight guard bits for 
overflow protection. The alternative is to require the programmer to right-shift (scale) products 
before accumulation, which complicates programming and causes loss of precision. Before 
accumulation, the product is sign-extended to 40-bits. With guard bits, the only loss of precision 
will typically occur at the end of a lengthy calculation when the 40-bit result must be stored to 
the general register file or to memory in 32-bit or 16-bit format. 

Fractional arithmetic is implemented by the program's interpretation of the 16-, 32- or 
40-bit quantities and is controlled by a bit in the MMD register 500. When fractional mode is 
selected, the dual MAC shifts the results of any multiply operation left by one bit to maintain the 
alignment of the implied radix point. Furthermore, since (-1) can be represented in fractional 
format but +1 cannot, in fractional mode the dual MAC detects when both operands of a multiply 
are equal to (-1). When this occurs, it generates the approximate product consisting of 0 for the 
sign bit (representing a positive result) and all l's ones for the remaining bits. This is true for 
both 16xl6-bit and 32x32-bit multiplications. The least significant bit of a product is always zero 
in fractional mode (due to the left shift). 

The accumulation units can add the product to, or subtract it from, one of the four 
accumulator registers 618 and 628. This operation can be performed with optional saturation; 
that is, if a result overflows (underflows), the accumulator is updated with the largest(smallest) 
positive (negative) number rather than the "wraparound" result with incorrect sign. The DSP 
instructions preferably include a multiply-add and multiply-sub instruction, each with and 
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without saturation. There are also instructions for adding or subtracting any pair of 40-bit 
accumulator registers together, with and without saturation. A bit in the MMD register 
determines whether the saturation is performed on the full 40-bits or whether saturation is 
performed at 32 bits. The latter capability is useful for emulating the results of other 
architectures that do not have guard bits. 

When the instruction requires multiplication, but no accumulation, the product is passed 
through the accumulation unit unchanged. (Thus, both 1 6-bit multiplication and multiply- 
accumulate require three MAC cycles.) 

A Round instruction can also be executed on one (or a pair) of the accumulator registers 
to reduce precision prior to storage. The rounding mode is selectable in MMD register 500. 
The output scalers 619 and 629 are used to right shift (scale) the accumulator register when it is 
transferred to the general register file. 

The dual MAC configuration consistent with the present invention is also used to execute 
the 32-bit MULT(U) and DIV(U) instructions specified in, for example, the MIPS ISA. In the 
case of MULT(U), one of the 16-bit Multiply-Accumulate datapaths works iteratively to produce 
the 64-bit product in five cycles. The least significant 32 bits are available one cycle earlier than 
the most significant 32 bits. MMD register 500 mode bits have no effect on the operation of the 
standard MIPS ISA instructions. By contrast, the MULTA instruction is subject to MMD 
register 500 mode bits for fractional arithmetic and truncated 32x32-bit multiplication. 

For the DSP MULTA, an accumulator pair M0h[31:0]/M01[31;0], Mlh[3 1 :0]/Ml 1 [3 1 :0] 
etc. is the target. M0h[3 1:0] is aliased to HI; M0 1 [3 1 :0] is aliased to LO. Unlike the (dual) 1 6- 
bit operations, single-cycle throughput is not available for 32-bit data. Because there are two 
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available data paths, however, two 32x32-bit multiply operations can be initiated every four 
cycles. The Dual MAC hardware configuration consistent with this invention automatically 
allocates the second operation to the available data path. If a third 32-bit multiplication is 
programmed too soon, stall cycles are inserted until one of the data paths is free. 

The Dual MAC configuration consistent with this invention also supports a complex 
multiply instruction, CMULTA. For this instruction, each of the 32-bit general register operands 
is considered to represent a 16-bit real part (in bits 31:16) and a 16-bit imaginary part (in bits 
15:00). One of the multipliers calculates the real part (32 bits) of the complex product (namely 
XrYr - XiYi) and stores it in the "h" half of the target accumulator pair. The other multiplier 
calculates the imaginary part (32 bits) of the complex product (namely XrYi + XiYr) and stores it 
in the half of the target accumulator pair. This instruction can be initiated every two cycles 
(2-cycle throughput) and takes four cycles to complete. As in the other Dual MAC operations, 
programming CMULTA instructions too close together causes stall cycles but the correct results 
are always obtained. 

The Dual MAC configuration consistent with this invention includes a separate Divide 
Unit 630 for executing the 32-bit DIV(U) operations specified by the MIPS ISA. The Divide 
requires 19 cycles to complete. The quotient is loaded into M01 [3 1 :0], Mll[3 1 :0], M21 [3 1 :0], or 
M3 1 [3 1 :0] ? and the remainder is loaded into the lower 32-bits of the other accumulator in the 
target pair. There is no special support for fractional arithmetic for the DIV operations. 

Because the Dual MAC configuration consistent with this invention is capable of 
consuming four 16-bit operands every cycle (in Pipe B) by performing two 16x16 bit multiply- 
accumulates, it is desirable to be able to fetch four 16-bit operands from memory every cycle (in 
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Pipe A). Therefore, the DSP extends the MIPS load and store instructions to include twinword 
accesses and implements a 64-bit data path from memory. A twinword memory operation 
accesses an (even-odd) pair of 32-bit general registers with a single instruction and executes in a 
single pipeline cycle. 

Like the standard byte, halfword, and word load/store instructions, the twinword 
load/store instructions use a register and an immediate field to specify the memory address. 
However, to fit into the format, the available signed 1 1-bit immediate field (called the 
displacement) is considered a twinword quantity, so is left-shifted by 3 bits before being added to 
the base register. This is equivalent to a 14-bit byte offset, in comparison to the full 16-bit 
immediate byte offset used in the byte, halfword and word instructions. Also, the target register 
pair for the twinword load/store must be an even-odd pair, so that only 4 bits are used to specify 
it. 

DSP algorithms usually operate on vectors or matrices of data; for example Discrete 
Cosine Transforms operate on 8x8 pixel blocks. As a result, data memory pointers are 
incremented from one operand to the next. The extra instruction cycle required to increment 
RISC memory pointers is eliminated in DSPs with auto-increment. Memory pointers are used 
unmodified to create the address, then updated in the general register file before the next use: 

address <- pointer 

pointer «- pointer + stride 

The 8-bit immediate field containing the stride is sign-extended to 32-bits before being added to 
the pointer for the latter' s update. The nomenclature "pointer" distinguishes the update 
performed after memory addressing, as opposed to the "base" register (in the MIPS ISA), which 
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is augmented by the offset before addressing memory in the standard instructions. The 
nomenclature "stride," which depends on the granularity of the access, distinguishes it from the 
invariant byte offset used in the standard load and store instructions. For twinword/word/ 
halfword addressing the 8-bit field is first left-shifted by three/two/one places and zero-filled, 
before sign extension to 32-bits. This use of left shifts for the twinword, word, and halfword 
strides is similar to the MIPS 16 ASE and is used to extend the effective address range. Thus, 
increments of between -128 and +127 twinwords, words, halfwords or bytes are available for 
each data type. 

In the case of Loads (but not Stores) pointer update requires a second general register file 
write port. An 8 port (4read/4write) register file has two of the four write ports dedicated to 
register Loads. As a result, twinword loads can execute in parallel with any Pipe B operation. 
Figs, 8A, 8B, 8C, and 8D contain a Table 800 of the instructions supported by an implementation 
consistent with this invention. Fig. 9 contains a Table 900 showing the assignment of 
instructions to Pipe A and Pipe B. 

5. Circular buffers 

For some DSP algorithms, notably filters, DSP data is organized into "circular buffers." 
In this case, the next reference after the end of the buffer is to the beginning of the buffer. 
Implementing this structure in RISC requires: 



Inst 1: 


LW 


reg, AddressReg 


Inst 2: 


BNEL 


AddressReg, BufferEnd, Continue 


Inst 3: 


ADDIU 


AddressReg, AddressREG + 4 


Inst 4: 


MOVE 


AddressReg, BufferStart 



-20- 



Continue: 

The above example is written so that a branch prediction failure will only be incurred at 
the end of the buffer. Nevertheless, the combination of post-modified pointers together with 
hardware support for circular buffers allows reducing this typical DSP addressing operation from 
four cycles to one. 

Fig. 10 is a block diagram illustrating circular buffer start registers (cbs 0 1010, cbs 1 
1013, and cbs2 1016) and circular buffer end registers (cbeO 1020, cbel 1023, and cbe2 1026) 
preferably located in ALU A 420 (Fig. 4). The circular buffer control logic is quite sophisticated 
in dealing with different data widths and decrement and well as increment of the circular buffer 
pointer. 

The DSP supports three circular buffers. To initialize the circular buffers, MTALU 
(Move To ALU) instructions are used to set the twinword start addresses cbs 0 1010, cbs 1 1013, 
and cbs 2 1016 [31:3] and twinword end addresses cbe 0 1020 cbe 1 1023, and cbe 3 1026 
[31:3]. Circular buffers are only used when memory pointers are post-modified, and consist of 
an integral number of twinwords. 

When a circular buffer pointer is used in a post-modified address calculation, the pointer 
is compared to the associated cbe address. If they match (and the stride is non-negative), the cbs 
address (rather than the post-modified address) is restored to the register file. Similarly, to allow 
for traversing the circular buffer in the reverse direction, the pointer is compared to the cbs 
address; if they match (and the stride is negative) the cbe address (rather than the post-modified 
address) is restored to the register file. 
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Also in Fig. 10 are AREG 1030, which holds the memory pointer, BIREG 1035, which 
holds the "stride" by which the memory pointer is modified, BRREG 1040, which holds the date 
to be stored in the case of store instructions, and TEMP 1045, which holds a pipeline stage 
register DADDRE 1050. The contents of AREG 1030 are transferred to DADDR_E 1050 as 
the memory address. 

A+BI 1060 is the memory pointer modified by the stride; Select signal 1064 indicates 
whether the memory pointer matches the circular buffer end address; and ALUREG 1070 is the 
output of the ALU. In this case, ALUREG 1070 holds the modified pointer that will be used in 
the next such memory address. 

Circular buffers can also be accessed with byte, halfword, or word Load/ store with 
Pointer Increment instructions. In those cases, the several least significant bits of the pointer 
register are examined to determine if the start or end of the buffer has been reached, taking into 
account the granularity of the access, before replacing the pointer with the cbs or cbe as 
appropriate. 

Any general register memory pointer can be used with circular buffers using the ".Cn" 
option. To use general register rP as a circular buffer pointer, for example, the instruction 

LWP.C2 r3, (r4)stride 
associates the r4 memory pointer with circular buffer C2 defined by the start address cbs 2 1016 
and end address cbe 2 1026. Figs. 1 1 A, 1 IB, and 1 1C contain a Table 1 100 summarizing vector 
addressing instructions with the implementation described above. 
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6. Instruction extensions 

The DSP accommodates extensions to the standard instruction sets, such as the MIPS 
instruction set, to support dual 16-bit operations, and also introduces a number of additional ALU 
instructions that improve performance on DSP algorithms. Supporting high-performance, dual 
16-bit operations in the RISC-DSP requires supporting not only dual MAC instructions but also 
dual 16-bit versions of other arithmetic operations that the programmer may require. 

In general, 16-bit immediates are not supported because the DSP extensions are encoded 
using the INST[05:00] "subop" field. Although dual 16-bit versions of logical operations such as 
AND may not be required, dual 16-bit versions have been provided for all 3 -register operand 
shifts and add/subtracts. In a preferred implementation, the character "2" in the assembler 
mnemonic indicates an operation on dual 16-bit data. 

DSP algorithms are often somewhat tolerant of data errors. A bad audio sample, for 
example, may cause a brief distortion, but no lasting effect as new audio samples arrive and the 
bad sample is cleared out of the buffer. Accordingly, the saturated result of signed arithmetic is a 
closer, more desirable, approximation than the wraparound result. Therefore, all arithmetic 
operations that may potentially produce arithmetic overflow or underflow, and do not have 
immediate operands, support optional saturation. For example, not only the dual 16-bit add 
(ADDR2), but also the 32-bit add (ADDR) have optional saturate. Neither the dual 16-bit 
instructions nor the 32-bit saturating adds and subtracts cause exceptions. 

A DSP consistent with this invention includes several additional ALU instructions for 
DSP performance analysis. Consistent with the approach described above, each additional 
instruction has both a 32-bit and a dual 16-bit version. If signed overflow/underflow is possible, 
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a saturation option is provided. These instructions are described in Table 1200 in Figs. 12 A and 
12B. 

Some DSPs provide a more extensive set of microcontrol functions; for example, field 
"set," "clear," and "extract" functions. In ATM cell processing, for example, these functions are 
useful in processing the cell headers. In DSPs consistent with systems and methods of the 
present invention, these operations can readily be executed using one or more RISC operations. 
For example: 

and r3, r2, rl (mask) // clears a field specified in the mask 
or r3 , r2, r 1 (mask) // sets a field specified in the mask 
sll r3 , r3 , n // extracts the field between [3 1 - n : 3 1 - (n+m)] 

srl(sra) r3, r3, m // the field is right-justified and 0 (sign) extended if 

// srl (sra) is selected. 

Figs. 13 A and 13B contain Table 1300 that lists and describes additional ALU operations. 
Several instructions allow conditional execution of Conditional Move (CMV<COND>). 

A number of DSPs and RISC processors have deployed extensive conditional execution. 
In these processors, the branch prediction penalty is three cycles or more. Conditional execution 
can mitigate the effect of the branch prediction penalty by allowing bypassing of the branch in 
some cases. Conditional execution is a costly alternative, however, because it uses instruction 
opcode bits and consequently limits the size of immediates and/or limits the number of general 
purpose registers visible to the program. The branch prediction penalty in the DSP of the present 
invention, however, only requires two cycles; therefore the value of conditional execution is 
minimized and only a restricted set of "conditional move" instructions is needed. The effect of 
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any conditional execution can be "emulated," however, with a sequence of two instructions by 
using the conditional move. For example: 

Processor with conditional execution: 

Inst 1 : ALU operation sets condition flags 

Inst 2: COND: ALU operation 
In the DSP consistent with this invention: 

Inst 1 : ALU operation updates register rB (condition setting operation) 

Inst 2: ALU operation with result directed to temp register rA 

Inst 3: CMV<COND> rD 5 rA, rB 

If rB satisfies the COND, rD is updated with rA; i.e. the 2nd ALU operation is executed to 

"completion." This sequence is interruptible. The if-then-else construct: 

if(rB COND) 
rD = rA 

else 

rD = rC 

can be coded if the previous example is prefaced with: 

MOVE rD, rC // move rC to rD 

This conditional move facilitates initial porting of assembler code from processors with 
conditional execution. Table 1400 in Fig. 14 lists conditional operations consistent with this 
invention. 

7. Zero Overhead Loop Facility 

DSP algorithms spend much of their time in short real-time critical code loops. To 
compensate, DSPs often include hardware support for "zero-overhead looping." Zero-overhead 
looping allows branching from the end-to-beginning of the loop can be accomplished without 
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explicit program overhead if the loop is to be executed a fixed number of times, known at 
compile time. Typically, loop counters and program start and/or end address registers are 
required. Zero-overhead looping cannot be nested without additional hardware. Often, zero- 
overhead loops cannot be interrupted. 

A DSP hardware configuration consistent with the disclosed embodiment supplies such a 
facility but allows the loop count to be determined at run time as well. The facility consists of 
three Radiax registers, which are accessible by a program running in User mode using Radiax 
instructions MFRU and MTRU. The operating system should consider these registers as part of 
the context of the executing process and should save and restore them in the case of an interrupt. 
These three registers include: 

LPE0[3 1 :2] maintaining a virtual address of the ending instruction of the loop; 

LPS0[28:2] holding low order bits of the virtual address of the "starting" instruction of 

the loop; and 

LPC0[15:0] holding the loop count value. 

Although this feature is intended to operate in loops, the algorithm executed by the 
hardware can be described more simply. In particular, it should be noted that there is no 
knowledge of being "inside" the loop. All that matters is the contents of the three registers when 
an attempt is made to execute the instruction at the address specified by LPEO: 

If (M32-mode, AND current-instruct-addr[31:2] = LPEO, AND LPCO 4 0) then 

execute current instruction (at LPEO [3 1 :2] || 00), 
decrement LPC0[15:0] by one, 
execute instruction at LPE0[31:29] || LPS0[28:2] || 00 
continue (LPS0 could be a jump/branch) 
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Else 

execute current instruction, 

continue (current instruction could be a jump/branch) 

In this embodiment, the order of loading the registers should be LPS, then LPE, then 
LPC with a non-zero value. Further, there should be at least two (2) instructions between the 
instruction that loads LPC with a non-zero value, and the instruction at the LPE address. To 
guarantee that no stall cycles are incurred, at least six (6) instructions should separate the 
instruction that loads LPC with a non-zero value, and the instruction at the LPE address. 
8. Cycle-by-cycle usage for Dual MAC instructions 

As explained above, a Dual MAC architecture eliminates programming hazards for its 
instructions by stalling the pipeline when necessary. It does this both to avoid resource conflicts 
and to wait for results of a first instruction to be ready before attempting to use those results in a 
second instruction. This means that no programming restrictions are needed to obtain correct 
results from a sequence of Dual MAC instructions. 

The most efficient use of the hardware, however, occurs when the program avoids these 
stalls, usually by properly scheduling the instructions. Table 1500 in Fig. 15 lists the cycles 
required between instructions to assist in such scheduling. Specifically, Table 1500 indicates the 
number of cycles that must be inserted between the first indicated instruction and the second 
indicated instruction. The number "0" means that the two instructions can be issued back-to- 
back. If the instructions are issued back to back, the Dual MAC architecture will stall the 
pipeline for the indicated number of cycles. Non-Dual MAC instructions can be issued to occupy 
those cycles, or other appropriate Dual MAC instructions can be issued in those cycles, such as 
those using non-conflicting accumulators. 
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The following code sequences indicate the most efficient use of the Dual MAC for coding 
the inner loop of some common DSP algorithms. The algorithms are presented for 1 6-bit 
operands with 16-bit results, as well as 32-bit operands with 32-bit results. The algorithms 
assume that fractional arithmetic is used. Therefore, for the 32-bit results of a 32x32 multiply, 
only the HI half of the target accumulator pair is retrieved or used. 

In these examples, only the Dual MAC instructions are shown. The other pipe is used to 
fetch and store operands and take care of loop housekeeping functions. The loops may need to 
be unrolled to take full advantage of the multiple Dual MAC accumulators. 

Case 1: 16-bit inner product SUM = SUM + Ai*Bi 

Assuming packed operands, two multiply-adds per cycle: 

MADDA2 mO, rl, r2 

MADDA2 mO, r3, r4 

MADDA2 mO, r5, r6 

MADDA2 mO, r7, r8 

Case 2: 16-bit vector product loop. Ci = Ai*Bi 

Assuming packed fractional operands, two multiplies per two cycles using two 

accumulator pairs. 

MULTA2 mO, rl, r2 

MFA2 ml, r8 

MULTA2 ml, r3, r4 

MFA2 mO, r7 
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Case 3: 16-bit complex vector product Ci = Ai ^complex Bi 

Assuming fractional operands packed as 16-bit real, 16-bit imaginary. One complex 
multiply every two cycles using two accumulator pairs. 

CMULTA mO, rl, r2 

MFA2 ml, r8 

CMULTA ml, r3, r4 

MFA2 mO, r7 



Case 4: 32-bit inner product loop: SUM = SUM + Ai*Bi 

Assuming a multiply-add every other cycle using one accumulator. 

MADDA mO, rl, r2 
non-DualMAC op 
MADDA ml, r3, r4 
non-DualMAC op 

Case 5: 32-bit vector product loop. Ci = Ai*Bi 

Assuming fractional 32-bit operands so that the MFA waits for the HI result of the 
MULTA. Achieves one multiply per two cycles using all the accumulators. 



MULTA 


mO, 


rl, 


r2 


MFA 


r9, 


mlh 




MULTA 


ml, 


r3, 


r4 


MFA 


rlO, 


m2h 




MULTA 


m2, 


r5, 


r6 


MFA 


rll, 


m3h 




MULTA 


m3, 


r7, 


r8 


MFA 


rl2, 


mOh 
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Case 6: 32-bit complex vector product. Ci = Ai ^complex Bi 

Assuming fractional 32-bit operands so that the ADDMA/SUBMA waits for the HI result 
of the second MULTA. Achieves one complex multiply per ten cycles using all the 
accumulators, with two inserted instructions. This is a good example of the cycles needed from 
MULTA to SUBMA/ADDMA (5 cycles for HI) and from SUBMA/ADDMA to MFA (2 cycles). 



MULTA 


mO, rl, r4 


; m[2i] * 


b[2i+l] 




MFA 


rigmag, alh 








MULTA 


ml,r2,r3 


; m[2i+l] * 


b[2i] 




SUB MA 


m3h, m2h, a3h 




c[2i-2] = 


= m[2i-2] *b[2i-2] - m[2i-l] *b[2i- 


non- 


op 








DualMac 










MULTA 


m2, rl, r3 


; m[2i] * 


b[2i] 




MFA 


rreal, m3h 








MULTA 


m3, r2, r4 


; m[2i+l] * 


b[2i+l] 




ADDMA 


mlh, mOh, mlh 


5 


c[2i+l] 


= a[2i+l] *b[2i] + a[2i] *b[2i+l] 



9. Low-overhead interrupts 

The DSP consistent with this invention accommodates eight low-overhead hardware 
interrupt signals that are useful for real-time applications. These Interrupts are supported with 
three coprocessor 0 registers: ESTATUS (0) 1610, ECAUSE (0) 1620, and INTVEC (0) 1630, 
which are illustrated in Fig. 16. 

ESTATUS register 1610 contains the mask bits IM[15-8] for the low priority interrupt 
signals. IM[15-8] is reset to 0 so that, regardless of the global interrupt enable, IEc, none of the 
interrupts will be activated. IP[15-8] for the interrupt signals is located in ECAUSE 1620. Each 
of these fields are similar to IM and IP defined in the R3000 Exception Processing model. One 
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difference is that the interrupts are prioritized in hardware and each have a dedicated Exception 
Vector. 

IP[15] has the highest interrupt priority, IP[14] next, etc. All new interrupts are higher 
priority than IP [0-7]. The Exception vector for the interrupts is located in INTVEC 
registerl630. The BASE for these vectors is program-defined. The vector for IP[15] is BASE || 
1 1 1000 ||, for IP[14] BASE || 1 10000 ||, etc. Thirty-two instructions can be executed at each 
vector; if more are required the program can jump to any address. 
10. Operational Codes 

DSP architecture consistent with this invention allows for an extension of a standard 
instruction set by designating a single I-Format as "LEXOP," then using the INST[5:0] "subop" 
field to permit up to 64 additional opcodes. Thus, the additional DSP opcodes model the MIPS 
"special" opcodes encoded in R-Format. The diagrams in Figs. 17A-H illustrate an example of 
the type of opcode extensions the present invention can designate. In this example, a "LEXOP" 
designation code with an I-Format 01 1 J 1 1 is used. Additional DSP opcodes can be integrated 
into an established MIPS instruction set by using a relocatable code to designate 64 subops. 

In the present embodiment, the default object code for LEXOP is 01 1_1 1 1 . However, the 
location can be moved using power/ground straps, for example. These straps or similar 
configurable devices help insure compatibility of the DSP extensions with future MIPS ISA 
extensions. The code assigned to "LEXOP" can be moved around with external power/ground 
straps to insure long-term compatibility of the DSP extensions with the MIPS ISA. 

The following principles are used to resolve potential ambiguity of encoding between the 
DSP of the present invention extensions and instructions: 

-31- 



a. Instructions with similar operations to an existing MIPS instruction set, but with 

additional operands permitted, are programmed with additional Assembler mnemonics 

and encoded as a LEXOP. For instance: 

multa ml, rl 5 r2 is encoded as a LEXOP instruction 
mult rl, r2 is encoded as a MIPS instruction 

multa m0 5 rl, r2 is encoded as a LEXOP instruction (aO is an alias for HI/LO). 

b. If a MIPS instruction is "extended" with additional functionality, it is programmed with 
additional Assembler mnemonics and encoded as a LEXOP. In the disclosed example, 
mnemonics ending in "r" indicate general register file targets; mnemonics ending in "m" 
indicate accumulator register targets. This convention removes ambiguity between the 
Lexra op and a similar MIPS op. For example, 

addr r3, rl, r2 is encoded as a LEXOP instruction 
add r3, rl, r2 is encoded as a MIPS instruction 

The MIPS add and the LEXOP addr are both signed 32-bit additions. However, on 
overflow the MIPS instruction triggers the Overflow Exception, while the LEXOP does not. 
Alternatively, the result of the LEXOP will saturate if the ".s" option is selected (addr.s). 

The foregoing description is presented for purposes of illustration and description. It is 
not exhaustive and does not limit the invention to the precise form disclosed. Modifications and 
variations are possible in light of the above teachings or may be acquired from practicing the 
invention. The scope of the invention is defined by the claims and their equivalents. 



-32- 



WHAT IS CLAIMED IS: 

1 . A digital signal processor comprising: 

two execution pipelines capable of executing RISC instructions; 
instruction fetch logic that simultaneously fetches two instructions and routes them to 
respective pipelines; and 

control logic to allow the pipelines to operate independently. 

2. The digital signal processor of claim 1 , wherein the instruction fetch logic includes 
logic that fetches dual SIMD instructions. 

3. The digital signal processor of claim 1 , further including two registers each half the 
length of a word fetched for memory, and 

wherein the instruction fetch logic that fetches a single word into the two registers 
simultaneously. 

4. The digital signal processor of claim 1, further including an eight port 
general register file. 

5. The digital signal processor of claim 4, wherein the general register file includes 
four read registers and 

four write registers. 
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6. A digital signal processor capable of integrating subopcodes into an established 
instruction set comprising: 

a memory that stores instructions having opcodes; 

an instruction decoder that identifies a relocatable opcode to designate 64 subopcodes; 

and 

a subopcode detector that decodes subopcodes if the instruction decoder identifies the 
relocatable opcode. 
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7. A digital signal processor comprising: 
a register pair; and 

means for executing a multiply instruction on a number stored in the register pair, 
including 

first means for performing multiply instructions on higher-order portions of each 
register in the register pair, 

second means for performing multiply instructions on the remaining portions of 
each register in the register pair, and 

third means for combining the results from the first and second means. 
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8. A circular buffer control circuit comprising: 
a first number of circular buffer start registers; 

a first number of circular buffer end registers, each associated with a different one of the 
circular buffer start registers; and 

circular buffer control logic including 

means for comparing a pointer to an address in a selected one of the circular 
buffer end registers, and 

means for restoring the address in the one of the circular buffer start registers 
associated with the selected circular buffer end register if the pointer matches the address 
in the selected circular buffer end register. 
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9. A digital signal processor capable of executing zero overhead looping instruction 
commands comprising: 
a register set; and 

means for executing a loop instruction command a fixed number of times on a number 
stored in the register set, including 

first means for executing a current instruction stored in a first portion of a first 
register within the register set; 

second means for decrementing a loop count value stored in a second register 
within the register set; 

third means for executing another portion of the current instruction stored in a 
second portion of the first register and a second register within the register set. 



10. The digital signal processor of claim 9, further including 

means for exiting the loop instruction command when the loop count value 
reaches zero. 
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ABSTRACT 

A DSP superscalar architecture employing dual multiply accumulate pipelines. Dual 
MAC pipelines allow for a seem less transition between established RISC instruction sets and 
extended DSP instructions sets. Relocatable opcodes are provide to allow further extensions of 
RISC instruction sets. The DSP superscalar architecture also provides memory pointers with 
hardware circular buffer support, an interruptible and nested zero-overhead loop counter, and 
prioritized low-overhead interrupts. 
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' MAC Radiax Instruction Summary 



Instruction 


Syntax and Description 


Dual Move to 
Accumulator 


MTA2[.G] rS, {mD, mDh, mDl} 

If MTA2, and mDh(mDl) is selected, sign-extend the contents of general regis- 
ter rS to 40-bits and move to accumulator register mDh(mDl). If MTA2, and 
mD is selected, update both mDh and mDl with the 40-bit, sign-extended con- 
tents of the same rS. If MTA2.G is selected, the accumulator register bits 
[39:32] are updated with rS[31:24]; bits [31:00] of the accumulator are 
unchanged. (The .G option is used to restore the upper-bits of the accumulator 
from the general register file; typically, following an Exception.) 


Move From 
Accumulator 


MFA rD, {mTh, mTl} [,n] 

Move the contents of accumulator register mTh or accumulator register mTl to 
register rD with optional right shift. Bits [3 1+n : n] from the accumulator regis- 
ter are transferred to rD[3 1:00]. The range n = 0 - 8 is permitted for the output 
alignment shift amount. In the case of n = 0, the field may be omitted. 


Dual Move From 
Accumulator 


MFA2 rD,mT[,n] 

Move the contents of the upper halves of accumulator register pair mT to regis- 
ter rD with optional right shift The rD[31:16] are taken from mTh and 
rD[ 15:00] from the corresponding mTl. mTh[31+n: 16+n] II mTl[31+n : 16+n] 
from the accumulator register pair are transferred to rD[3 1 :00]. The range n = 0 
- 8 is permitted for the output alignment shift amount. In the case of n = 0, the 
field may be omitted. 


Divide 


UNA mD, rS, rT 

The contents of register rS is divided by rT, treating the operands as signed 2's 
complement values. The remainder is sign-extended to 40-bits and stored in 
mDh and the quotient is sign-extended to 40-bits and stored in mDl. 
m0h[31:00] is also called HI. m01[31:00] is also called LO. 


Divide Unsigned 


DIVAU mD t rS, rT 

The contents of register rS is divided by rT, treating the operands as unsigned 
values. The remainder is zero-extended to 40-bits and stored in mDh and the 
quotient is zero-extended to 40-bits and stored in mDl. m0h[31:00J is also 
called HI. m01[31:00] is also called LO. 


Multiply 
(32-bit) 


MULTA mD, rS, rT 

The contents of register rS is multiplied by rT, treating the operands as signed 
2*s complement values. Tfie upper 32-bits of the 64-bit product is sign- 
extended to 4U-bits ana stored m mun ano tne lower jz-diis is zero-exicnaeu io 
40-bits and stored in the corresponding mDl. m0h[31:00] is also called HI. 
m01[3 1 :00] is also called LO. If MMDfMT] is 1 , then the partial product 
rS[15:00] x rT[15:00] is not included in the total product. If MMD[MF] is 1, 
then the product is left shifted by one bit, and furthermore, if both operands are 
-1 then the product is set to positive signed, all ones fraction, prior to the shift 
If both MMD[MT] and MMD[MF] are 1, the result is undefined. 



Instruction 


Syntax and Description 


Multiply 

Unsigned 

(32-bit) 


J4ULTAU mD,rS,rT 

The contents of register rS is multiplied by rT, treating the operands as 
unsigned values. The upper 32-bits of the 64-bit product is zero-extended to 40- 
bits and stored in mDh and the lower 32-bits is zero-extended to 40-bits and 
stored in the corresponding mDL m0h[31:00] is also called HI. m01[31:00] is 
also called LO. If MMD[MT] is 1, then the partial product rS[15:00] x 
rT[15:00] is not included in the total product If MMD[MF] is 1, then the result 
is undefined. 


Dual Multiply 
(16-bit) 


MULTA2 {mD, mDh, mDl], rS, rT 

The contents of register rS is multiplied by rT, treating the operands as signed 
2's complement values. If the destination register is mDh, rS[31:16] is multi- 
plied by rT[31:16] and the product is sign-extended to 40-bits and stored in 
mDh. If the destination register is mDl, rS[15:00] is multiplied by rT[15:00] 
and the product is sign-extended to 40-bits and stored in mDL If the destination 
is mD, both operations are performed and the two products are stored in the 
accumulator register pair mD. If MMD[MF] is 1, then each product is left 
shifted by one bit, and furthermore, for each multiply, if both operands are -1 
then the product is set to positive signed, all ones fraction. 


Dual Multiply and 

Negate 

(16-bit) 


MULNA2 {mD, mDh, mDl}, rS t rT 

The contents of register rS is multiplied by rT, treating the operands as signed 
2's complement values. If the destination register is mDh, rS[3 1 : 16] is multi- 
plied by rT[31:16] and the product is sign-extended to 40-bits, negated (i.e. 
subtracted from zero) and stored in mDh. If the destination register is mDl, 
rS[ 15:00] is multiplied by rT[ 15:00] and the product is sign-extended to 40- 
bits, negated (i.e. subtracted from zero) and stored in mDl. If the destination is 
mD, both operations are performed and the two products are stored in the accu- 
mulator register pair mD. If MMD[MF] is 1 , then each product is left shifted by 
one bit prior to sign-extension and negation, and furthermore, for each multi- 
ply, if both operands are -1 then the product is set to positive signed, all ones 
fraction prior to sign-extension and negation. 


Complex 
Multiply, 


CMULTA _ mD, rS, rT 

rS[31:16] is interpreted as the real part of a complex number. rS[ 15:00] is inter- 
preted as the imaginary part of the same complex number. Similarly for the 
contents of general register rT. As the result of CMULTA, mDh is updated with 
the real part of the product, sign-extended to 40-bits and mDl is updated with 
the imaginary part of the product, sign-extended to 40-bits. If MMD[MF] is 1, 
then each product is left shifted by one bit, and furthermore, for each multiply, 
if both operands are - 1 then the product is set to positive signed, all ones frac- 
tion, prior to the addition of terms. 



Instruction 


Syntax and Description 


32-bit Multiply- Add 
with 72-bit accumu- 
late 


MADDA mD, rS, rT 

*The contents of register rS is multiplied by rT treating the operands as signed 
2's complement values. If MMD[MT] is 1, then the partial product rS[15:00] x 
rT[15:00] is not included in the total product. If MMD[MF] is 1, then the prod- 
uct is left shifted by one bit, and furthermore, if both operands are -1 then the 
product is set to a positive signed, all ones fraction. If both MMD[MT] and 
MMD[MF] arfc 1, then the result of the multiply is undefined. 
The 64-bit product is sign-extended to 72-bits and added to the concatenation 
mDh[39:0] il mDl[31:0], ignoring mDl[39:32J. The lower 32 bits of the result 
are zero-extended to 40-bits and stored into mDl. The upper 40-bits of the 
result are stored into mDh. 


32-bit unsigned Mul- 
tiply- Add with 72-bit 
accumulate 


MADDAU mD, rS 9 rT 

The contents of register rS is multiplied by rT treating the operands as unsigned 
values. If MMD[MT] is 1, then the partial product rS[15:00] x rT[15:00] is not 
included in the total product If MMD[MF] is 1, then the result of the multiply 
is undefined. 

The 64-bit product is zero-extended to 72-bits and added to the concatenation 
mDh[39:0] II mDl[31:0], ignoring mDl[39:32]. The lower 32 bits of the result 
are zero-extended to 40-bits and stored into mDl. The upper 40-bits of the 
result are stored into mDh. 


Dual Multiply- Add, 

optional 

saturation 


MADDA2[.S] {mD, mDh, mDl}, rS, rT 

The contents of register rS is multiplied by rT and added to an accumulator reg- 
ister, treating the operands as signed 2's complement values. If the destination 
register is mDh, rS[31:16] is multiplied by rT[31:16] then sign-extended and 
added to mDh[39:00]. If the destination register is mDl, rS[15:00] is multiplied 
by rT[15:00] then sign-extended and added to mDl[39:00]. If the destination 
is mD, both operations are performed and the two results are stored in the accu- 
mulator register pair mD. If MADDA2.S the result of each addition is saturated 
before storage in the accumulator register. The multiplies are subject to 
MMD[MF] as in MULTA2. The saturation point is selected as either 40 or 32 
bits by MMD[MS]. 


32-bit Multiply-Sub- 
tract with 72-bit 
accumulate 


MSUBA mD, rS, rT 

The contents of register rS is multiplied by rT treating the operands as signed 
2's complement values. If MMD[MT] is 1, then the partial product rS[15:00] x 
rT[ 15:00] is not included in the total product. If MMD[MF] is 1, then the prod- 
uct is left shifted by one bit, and furthermore, if both operands are -1 then the 
product is set to a positive signed, all ones fraction. If both MMD[MT] and 
MMD[MF] are 1, then the result of the multiply is undefined. 
The 64-bit product is sign-extended to 72-bits and subtracted from the concate- 
nation mDh[39:0] i! mDl[31:0], ignoring mDl[39:32]. The lower 32 bits of the 
result are zero-extended to 40-bits and stored into mDl. The upper 40-bits of 
the result are stored into mDh. 
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Instruction 


Syntax and Description 


32-bit unsigned Mul-_ 
tiply-Subtract with 
72-bit accumulate 


J4SUBAU mD, rS, rT 

The contents of register rS is multiplied by rT treating the operands as unsigned 
values. If MMD[MT] is 1, then the partial product rS[15:00] x rT[15:00] is not 
included in the total product. If MMD[MF] is 1, then the result of the multiply 
is undefined. 

The 64-bit product is zero-extended to 72-bits and subtracted from the concate- 
nation mDh[39:0] II mDi[31:0], ignoring mDl[39:32], The lower 32 bits of the 
result are zero-extended to 40-bits and stored into mDl. The upper 40-bits of 
the result are stored into mDh. 


Dual Multiply-Sub, 

optional 

saturation 


MSUBA2[S] {mD, mDh, mDl}, rS, rT 

The contents of register rS is multiplied by rT and subtracted from an accumu- 
lator register, treating the operands as signed 2's complement values. If the des- 
tination register is mDh, rS[31:16] is multiplied by rT[31:16] then sign- 
extended and subtracted from mDh[39:00]. If the destination register is mDl, 
rS[15:00] is multiplied by rT[15:00] then sign-extended and subtracted from 
mDl[39:00]. If the destination is mD, both operations are performed and both 
results are stored in the accumulator register pair mD. If MSUB A2.S the result 
of each subtraction is saturated before storage in the accumulator register. 


Add 

Accumulators 


ADDMAiS] mD{h,l}, mSfhJJ, mT{h,l] 

The contents of accumulator mTh or mTl is added to the contents of accumula- 
tor mSh or mSl, treating both registers as signed 40-bit values. mDh or mDl is 
updated with the result. If ADDMA.S, the result is saturated before storage. 
The saturation point is selected as either 40 or 32 bits by MMD[MS]. 


Subtract 
Accumulators 


SUBMA[.S] mD{h,l}, mS{h,lj, mTfhJj 

The contents of accumulator mTh or mTl is subtracted from the contents of 
accumulator mSh or mSl, treating both registers as signed 40-bit values. mDh 
or mDl is updated with the result If SUBMA.S, the result is saturated before 
storage.The saturation point is selected as either 40 or 32 bits by MMD[MS]. 


Dual Round 


RNDA2 {mT, mTh, mTl} [,nj 

The accumulator register mTh or mTl is rounded, then updated. If mT, the 
accumulator register pair mTh/mTl are each rounded, then updated. The round- 
ing mode is selected in MMD field "RND". The least significant bit of preci- 
sion in the accumulator register after rounding is: 16+n. Bits [15+n : 00] are 
zeroed. The range n = 0 - 8 is permitted for the output alignment shift amount. 
In the case of n = 0, the field may be omitted. 



Nomenclature: 



rS.rT 


iO-r31 


mD 


= mDh II mDl; also for mT 


mDh 


= mOh - m3h; also for mSh, mTh 


mDl 


= m0I-m31; also for mSh, mTh 


HI 


m0h[31:00] 


LO 


m01[31:00] 



Assignment of Instructions of Pipe A, Pipe B 





Pipe A 


PipeB 


The Load/Store Pipe 


The MAC Pipe 


MIPS 
32-bit 
General 
Instructions 


MIPS 32-bit General Instructions 
except: 

CE1 Custom Engine Opcodes, 
MULT(U), DIV(U), MFHI, MFLO, 
MTHI, MTLO,MAD(U),MSUB(U) 


MULT(U), DIV(U), MFHI, MFLO, 
MTHI, MTLO,MAD(U),MSUB(U) 
CE1 Custom Engine Opcodes, 
MIPS 32-bit ALU Instructions 
Note: No Load or Store Instructions 


MIPS 
32-bit 
Control 
Instructions 


J, JAL, JR, JALR, JALX 
SYSCALL, BREAK, 
All Branch Instructions, 
All COPz, SWCz, LWCz 




MIPS 16 
Instructions 
(No Doubleword 
Instructions) 


All MIPS 16 Instructions except: 
MULT(U), DIV(U), MFHI, MFLO 


MULT(U), Dr/(U), MFHI, MFLO 


EJTAG 
Instructions 


DERET, SDBBP 
(including MIPS 16 SDBBP) 




Lexra 
Control Instructions 


MTRU, MFRU, MTRK, MFRK, 
MTLXC0,MFLXC0 




Lexra 
Vector Addressing 


LT, ST, 
LTP, LWP, 
LHP(U), LBP(U), 
STP, SWP, 
SHP, SBP 




Lexra 
MAC Instructions 




MTA2, MFA, MFA2, MULTA, 
MULTA2, MULNA2, CMULTA, 
MADDA, MSUBA, ADDMA, 
SUB MA, DIVA, RNDA2 


Lexra 
Extensions to MPS 
ALU Instructions 


-SLLV2, SRLV2, SRAV2, 
ADDR, ADDR2, SUBR, SUBR2, 
SLTR2 


SLLV2, SRLV2, SRAV2, 
ADDR, ADDR2, SUBR, SUBR2, 
SLTR2 


New 
Lexra 
ALU Operations 


MIN, MIN2, MAX, MAX2, ABSR, 
ABS2, CLS, MUX2, BITREV, 
CMVEQZ, CMVNEZ 


MIN, MIN2, MAX, MAX2, ABSR, 
ABS2, CLS, MUX2, BITREV, 
CMVEQZ, CMVNEZ 



Vector Addressing Instruction Summary 



Instruction 


^ Syntax and Description 


Load Twinword 


LT rT, displacement(base) 

The displacement, in bytes, is a signed 14-bit quantity that must be divisible by 
8 (since it occupies only 1 1 bits of the instruction word). Sign-extend the dis- 
placement to 32-bits and add to the contents of register base to form the address 
temp. Load contents of word addressed by temp into register rT (which must be 
an even register). Load contents of word addressed by temp+A into register 
iT+1. 


Store Twinword 


ST rT, displacement base) 

The displacement, in bytes, is a signed 14-bit quantity that must be divisible by 
8 (since it occupies only 1 1 bits of the instruction word). Sign-extend the dis- 
placement to 32-bits and add to the contents of register base to form the address 
temp. Store contents of register rT (which must be an even register) into word 
addressed by temp. Store contents of register rT+1 into word addressed by 
temp+4. 


Load Twinword, 
Pointer Increment, 
optional circular 
buffer 


LTP[.Cn] rT, (pointer)stride 

Let temp = contents of register pointer. Load contents of word addressed by 
temp into register rT (which must be an even register). Load contents of word 
addressed by temp+A into register rT+1. The stride, in bytes, is a signed 1 1-bit 
quantity that must be divisible by 8 (since it occupies only 8 bits of the instruc- 
tion word). Sign-extend the stride to 32-bits and add to contents of register 
pointer to form next address. Update pointer with the calculated next address. 
" Cn" selects circular buffer n — 0 - 9 ^*»f» Mnt#* 9 


Load Word, 
Pointer Increment, 
optional circular 
buffer 


LWP[.Cn] rT 9 (pointer)stride 

Load contents of word addressed by register pointer into register rT. The stride, 
in bytes, is a signed 10-bit quantity that must be divisible by 4 (since it occu- 
pies only 8 bits of the instruction word). Sign-extend the stride to 32-bits and 
add to contents of register pointer to form next address.Update pointer with the 
calculated next address. ".Cn" selects circular buffer n = 0 - 2. See Note 2. 


Load Halfword, 
Pointer Increment, 
optional circular 
buffer 


LHP[. Cn] rT, (pointer)stride 

Load contents of sign-extended halfword addressed by register pointer into reg- 
ister rT. The stride, in bytes, is a signed 9-bit quantity that must be divisible by 
2 (since it occupies only 8 bits of the instruction word). Sign-extend the stride 
to 32-bits and add to contents of register pointer to form next address. Update 
pointer with the calculated next address. ".Cn" selects circular buffer n = 0 - 2. 
See Note 2. 


Load Halfword 
Unsigned, 
Pointer Increment, 
optional circular 
buffer 


LHPU[. Cn] rT f (pointer)stride 

Load contents of zero-extended halfword addressed by register pointer into 
register rT. The stride, in bytes, is a signed 9-bit quantity that must be divisible 
by 2 (since it occupies only 8 bits of the instruction word). Sign-extend the 
stride to 32-bits and add to contents of register pointer to form next address. 
Update pointer with the calculated next address. ".Cn" selects circular buffer n 
= 0-2. See Note 2. 



Vector Addressing Instruction Summary 



Instruction 


Syntax and Description 


Load Byte, 
Pointer Increment, 
optional circular 
buffer 


- J^BPf. Cn] rT f (pointer)stride 
Load contents of sign-extended byte addressed by register pointer into register 
rT. The stride, in bytes, is a signed 8-bit quantity. Sign-extend the stride to 32- 
bits and add to contents of register pointer to form next address. Update pointer 
with the calculated next address. ".Cn" selects circular buffer n = 0 - 2. See 
Note 2. 


Load Byte 
Unsigned, 
Pointer Increment, 
optional circular 
buffer 


LBPUlCn] rT, (pointer)stride 

Load contents of zero-extended byte addressed by register pointer into register 
rT. The stride, in bytes, is a signed 8-bit quantity. Sign-extend the stride to 32- 
bits and add to contents of register pointer to form next address. Update pointer 
with the calculated next address. ".Cn" selects circular buffer n = 0 - 2. See 
Note 2. 


Store Twinword, 
Pointer Increment, 
optional circular 
buffer 


STP[.Cn] rZ (pointer)stride 

Let temp = contents of register pointer. Store contents of register rT (which 
must be an even register) into word addressed by temp. Store contents of regis- 
ter rT+1 into word addressed by temp+A. The stride, in bytes, is a signed 1 1-bit 
quantity that must be divisible by 8 (since it occupies only 8 bits of the instruc- 
tion word). Sign-extend the stride to 32-bits and add to contents of register 
pointer to form next address.Update pointer with the calculated next address. 
u .Cn" selects circular buffer n = 0 - 2. See Note 2. 


Store Word, 
Pointer Increment, 
optional circular 
buffer 


SWPf.Cn] rT, (pointer)stride 

Store contents of register rT into word addressed by register pointer. The stride, 
in bytes, is a signed 10-bit quantity that must be divisible by 4 (since it occu- 
pies only 8 bits of the instruction word). Sign-extend the stride to 32-bits and 
add to contents of register pointer to form next address. Update pointer with 
the calculated next address. ".Cn" selects circular buffer n = 0 - 2. See Note 2. 


Store Half word, 
Pointer Increment, 
optional circular 
buffer 


SHP[. Cn] rT, (pointer)stride 

Store contents of register rT[15:00] into 16-bit halfword addressed by register 
pointer The stride, in bytes, is a signed 9-bit quantity that must be divisible by 
2 (since it occupies only 8 bits of the instruction word). Sign-extend the stride 
to 32-bits and add to contents of register pointer to form next address. Update 
pointer with the calculated next address. ".Cn" selects circular buffer n = 0 - 2. 
See Note 2. 


Store Byte, 
Pointer Increment, 
optional circular 
buffer 


SBP[.Cn] rZ (pointerjstride 

Store contents of register rT[07:00] into byte addressed by register pointer. The 
stride, in bytes, is a signed 8-bit quantity. Sign-extend the stride to 32-bits and 
add to contents of register pointer to form next address. Update pointer with 
the calculated next address. ".Cn" selects circular buffer n = 0 - 2 <?ee Nnf^ 0 


Move To Radiax, 
User 


MTRU rT, RADREG 

Move the contents of register rT to one of the User Radiax registers: cbsO - 
cbs2, cbeO - cbe2, mmd, lpcO, IpeO, IpsO. This instruction has a single delay slot 
before the updated register takes effect. 



Instruction 


Syntax and Description 


Move From Radiax, - 
User 


JMFRU rZ RADREG 

Move the contents of the designated User Radiax register (cbsO - cbs2, cbeO - 
cbe2, mind, lpcO, IpsO, IpeO) to register rT. 



Nomenclature: 



rT = iO - r3 1 , and must be even for LT, ST, LTP[.Cn], STP[.Cn] 

base, pointer= iO - r3 1 

stride = 8/9/10/1 1-bit signed value (in bytes) for byte/haifword/word/twinword ops. 

displacement= 14-bit signed value, in bytes 

RADREG = cbsO - cbs2, cbeO - cbe2, rnmd, lpcO, IpsO, IpeO 



Notes: 

1. For LTP[.Cn], LWP[.Cn], LHP(U)[.Cn], LBP(U)[.Cn], rT = pointer is unsupported. 

2. When a circular buffer is selected, the update of the pointer register is performed according to the following 
algorithm, which depends on the sign of the stride and the granularity of the access. A stride exacdy equal to 0 is not 
supported: 



For LBP(U).Cn and SBP.Cn: 



if (stride > 0 pointer [2:0] == 111 pointer [31 : 3 ] == CBEn) 

then pointer <= CBSn[31:3] (j 000 

else if (stride < 0 pointer [2:0] — 000 pointer [31 : 3 ] == CBSn) 

then pointer <= CBEn [31: 3] || 111 
else pointer <= pointer + stride. 

For LHP(U).Cn and SHP.Cn 

if (stride > 0 pointer [2:0] ~ llx pointer [31 : 3] 

then pointer <= CBSn [31:3] j| 000 

else if (stride < 0 pointer [2:0] == OOx && pointer [31 : 3 ] 

then pointer <= CBEn [31:3] || 110 

else pointer <= pointer + stride. 

ForLWRCn and SWP.Cn 

if (stride > 0 pointer[2:0] « lxx pointer [31 :3] 

then pointer <= CBSn[ 3d:3] || 000 

else if (stride < 0 pointer [2:0] == Oxx && pointer [31 : 3 ] 

then pointer <= CBEn[ 31:3] j| 100 

else pointer <= pointer + stride. 

For LTP.Cn and STP.Cn 

if (stride > 0 pointer [31 : 3 ] — CBEn) 

then pointer <= CBSn [31:3] |j 000 
else if (stride < 0 pointer [31:3] == CBSn) 

then pointer <- CBEn [31: 3] || 000 

else pointer <- pointer + stride. 



== CBEn) 
== CBSn) 



== CBEn) 
== CBSn) 



Extensions to MIPS ALU Operations 



Instruction 


c Syntax and Description 


Dual Shift Left 
Logical Variable 


SLLV2 rD, r% rS 

The contents of rT[31:16] and the contents of rT[ 15:00] are independently 
shifted left by the number of bits specified by the low order four bits of the con- 
tents of general register rS, inserting zeros into the low order bits of rT[3 1:16] 
and rT[15:0Oj. For SLLV2, the high and low results are concatenated and 
placed in register rD. (Note that a [.S] option is not provided because this is a 
logical rather than arithmetic shift and thus the concept of arithmetic overflow 
is not relevant.) 


Dual Shift Right 
Logical Variable 


SRLV2 rD, rT, rS 

The contents of rT[31:16] and the contents of rT[15:00] are independently 
shifted right by the number of bits specified by the low order four bits of the 
contents of general register rS, inserting zeros into the high order bits of 
rT[31:16] and rT[15:00]. The high and low results are concatenated and placed 
in register rD. (Note that a [.S] option is not provided because this is a logical 
rather than arithmetic shift and thus the concept of arithmetic overflow is not 
relevant) 


Dual Shift Right 
Arithmetic Variable 


SRAV2 rD, rT f rS 

The contents of rT[31:16] and the contents of rT[15:00] are independendy 
shifted right by the number of bits specified by the low order four bits of the 
contents of general register rS, sign-extending the high order bits of rT[31:16] 
and rT[ 15:00], The high and low results are concatenated and placed in register 
rD. (Note that a [.S] option is not provided because arithmetic overflow/under- 
flow is not possible.) 


Add, 

optional saturation 


ADDRf.S] rD, rS, rT 

32-bit addition. Considering both quantities as signed 32-bit integers, add the 
contents of register rS to rT. For ADDR, the result is placed in register rD, 
ignoring any overflow or underflow. For ADDR.S, the result is saturated to 0 II 
l 31 (if overflow) or 1 II 0 31 (if underflow) then placed in rD. ADDR[.S] will not 
cause an Overflow Trap. 


Dual Add, 
ODtionaJ saturation 


ADDR2[.S] rD, rS, rT 

iyu<u lu-un <tumuua. v^uiiMuenng ail quantities as signea 10-oit integers, add. 
the contents of register rS[15:00] to rT[15:00] and, independendy add the con- 
tents-ofregister rS[31:l6J to rT[31:16]. For ADDR2, the high and low results 
are concatenated and placed in register rD ignoring any overflow or underflow. 
For ADDR2.S, the two results are independendy saturated to 0 II 1 15 (if over- 
flow) or 1 II 0 15 (if underflow) then placed in rD. ADDR2[.S] will not cause an 
Overflow Trap. 



; 
I 



Instruction 


Syntax and Description 


Subtract, 

ODtional saturation 


SUBR[.S] rD, rS, rT 

3£ on suDrxacnon. considering both quantities as signed 32-bit integers, sub- 
tract the contents of register rT from the contents of register rS. For SUBR, the 
result is placed in register rD ignoring any overflow or underflow. For SUBR.S, 
the result is saturated to 0 II l 31 (if overflow) or 1 II 0 31 (if underflow) then 
placed in rD. SUBR[.S] will not cause an Overflow Trap. 


Dual Subtract, 
optional saturation 


SUBR2[.S] rD, rS, rT 

Dual 16-bit subtraction. Considering all quantities as signed 16-bit integers, 
subtract the contents of register rT[15:00] from rS[15:00] and, independently 
suotract tne contents or register rT[31:16] from rS[31:16], For SUBR2, the 
high and low results are concatenated and placed in register rD ignoring any 
overflow or underflow. For SUBR2.S, the two results axe independently satu- 
rated to 0 II 1 15 (if overflow) or 1 II 0 15 (if underflow) then placed in rD. 
SUBR2[.S] will not cause an Overflow Trap. 


Dual Set On Less 
Than 


SLTR2 rD, rS, rT 

Dual 16-bit comparison. Considering both quantities as signed 16-bit integers, 
if rS[15:00} is less than rT[15:00] then set rD[15:00] to 0 15 II 1, else to zero. 
Independently, considering both quantities as signed 16-bit integers, if 
rS[31:16} is less than rT[31:16} then set rD[31:16] to 0 15 II 1, else to zero. 



Nomenclature: 

rD = r0-r31 

rS = r0-r31 

rT = r0-r31 



FIG- \2.B 
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Instruction 


Syntax and Description 


Minimum 


MIN rD, rS, rT 

The contents of the general register rT are compared with rS considering both 
quantities as signed 32-bit integers. If rS < rT or rS = rT, rS is placed into rD. If, 
rS > rT, rT is placed into rD. 


Dual Minimum * 


M1N2 r D, rS, rT 

The contents of rT[31:16] are compared with rS[31:16] considering both quan- 
tities as signed 16-bit integers. If rS[31:16] < rT[31:16] orrS[31:16] = 
rT[31:16],rS[31:16] is placed into rD[3 1:16]. If, rS[3 1:16] >rT[3 1:16], 
rT[31:16] is placed into rD[31:16]. A similar, independent operation is per- 
formed on rT[15:00] andrS[15:00] to determine rD[15:00]. 


Maximum 


MAX rD, rS, rT 

The contents of the general register rT are compared with rS considering both 
quantities as signed 32-bit integers. If rS > rT or rS = rT, rS is placed into rD. If, 
rS < rT, rT is placed into rD. 


Dual Maximum 


MAX2 rD, rS, rT 

The contents of rT[31:16] are compared with rS[31:16] considering both quan- 
tities as signed 1 6-bit integers. If rS[3 1 : 1 6] > rT[3 1 : 1 6] or rS[3 1 : 16] = 
rT[31:16],rS[31:16] is placed into rD[31:16]. If, rS[31:16] <rT[31:16], 
rT[3 1 : 1 6] is placed into rD[3 1:16]. A similar, independent operation is per- 
iuniicu on ri^ij.LAjj anu. ro^ij.uuj to ucicmiiiic ri^|_ij.uvj. 


Absolute, 
optional saturation 


ABSR[.SJ rD,rT 

Considering rT as a signed 32-bit integer, if rT > 0, rT is placed into rD. If rT < 
0, -rT is placed into rD. If ABSR.S and rT = 1 II 0 31 (the smallest negative num- 
ber) then 0 II 1 (the largest positive number) is placed into rD; otherwise, if 
ABSR and rT = 1 II 0 31 , rT is placed into rD. 


Dual Absolute, 
optional saturation 


ABSR2[.SJ rD, rT 

ABS[.S] operations are performed independently on rT[31:16] and rT[15:00], 
considering each to be 16-bit signed integers. rD is updated with the absolute 
value of rT[3 1:16] concatenated with the absolute value of rT[ 15:00]. 


Dual Mux 


MUX2([.HH], [.HL], [.IB], f.LLJJ rD, rS, rT 
rD[31:16] is updated with rS[31:16] for MUX2.HH or MUX2.HL. 
rD[31:16] is updated withrS[15:00] for iMUX2.LH or MUX2JLL. 
rD[15:00] is updated with rT[31:16] for MUX2.HH or MUX2.LH 
rD[15:00] is updated with rT[15:00] for MUX2.HL or MUX2.LL 


Count Leading Sign 
bits 


CLS rD, rT 

The binary-encoded number of redundant sign bits of general register rT is 
placed into rD. If rT[31:30] = 10 or 01, tD is updated with 0. If tT = 0, or if rT 

= l 32 , rD is updated with 0 27 II l 5 (decimal 31). 



ALU Operations 



Instruction 




Bit Reverse 


BllKtV rU, rl, ro 




A bit-reversal of the contents of general register rT is performed. The result is 




then shifted right logically by the amount specified in the lower 5-bits of the 




contents of general register rS, then stored in rD. 


Nomenclature: 




rD 


i0-r31 


rS 


r0-r31 


rT 


i0-r31 



Conditional Operations 



moo 



Inst ruction 


Syntax and Description 


Conditional Move on 
Equal Zero 


CMVEQZ[.H] [X] rD, rS, rT 

If the general register rT is equal to 0, the general register rD is updated with 
rS; otherwise rD is unchanged. For [.H] if rT[3 1:16] is equal to 0, the full 32-bit 
general register rD[3 1:00] is updated with rS; otherwise rD is unchanged. For 
[X] if rT[15:00] is equal to 0, the fall 32-bit general register rD[31:00] is 
updated with rS; otherwise rD is unchanged. 


Conditional Move on 
Not Equal Zero 


CMVNEZ[.H] [X] rD, rS, rT 

If the general register rT is not equal to 0, the general register rD is updated 
with rS; otherwise rD is unchanged. For [.H] if rT[31:16] is not equal to 0, the 
fall 32-bit general register rD[31:00] is updated with rS; otherwise rD is 
unchanged. For [.L] if rT[15:00] is not equal to 0, rhtfall 32-bit general regis- 
ter rD[31:00] is updated with rS; otherwise rD is unchanged 



Nomenclature: 



rD 
rS 
rT 



r0-r31 
r0-r31 
r0-r31 



Usage Note: 

When combined with the SLT or SLTR2 instructions, the conditional move instructions can be used to construct a 
complete set of conditional move macro-operations. For example: 



if (r3<r4) rl<-r2 

CMVLT rl,r2,r3,r4 

if ( r3 >=r4) rl <- r2 

CMVQE rl,r2,r3,r4 

if(r3<=r4) rl<-r2 

CMVLE rl,r2,r3,r4 

f(r3 > r4) rl<-r2 

CMVGT rl,r2,r3,r4 



=> SLT 
CMVNEZ 



=> SLT 
CMVEQZ 



=> SLT 
CMVEQZ 



=> SLT 

CMVNEZ 



AT,r3,r4 
rl,r2,AT 



AT,r3,r4 
rl,r2,AT 



AT,r4,r3 
rl,r2,AT 



AT,r4,r3 
rl / r2 / AT 
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ADDMAt.S], 

SUBMAl.SJ, 

RNDA2 


CMULTA, 
MADDA2[.SJ. 
MSUBA21.SJ, 
MULTA2, MULNA2, 
MTA2 


DIVA(U) 


MULTA(U), 

MADDA(U),MSUBA(U) 
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MADDA(U), 
MSUBA(U) 
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CMULTA 


C/2 


(I9S) 
(19T) 


(19T) 
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(19T) 
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DIVA fin 


ro 
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1 
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MADDA2[.S], 
MSUBA2[.S], 
ADDMA[.S], 
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MTA2 
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I. Load/Store Formats 



11 *2625 2120 17 16 


6 


5 0 


LEXOP 
01U11 


base 


rt-even 


immediate 


SUBOP 



6 5 4 11 



Assembler Mnemonic 


base 


rt-even 


immediate 


Lexra SUBOP 


LT 


base 


rt-even 


displacement/8 


LT 


ST 


base 


it-even 


displacement/8 


ST 



LEXOP 
01 1.111 


pointer 


rt 


immediate 


cc 


SUBOP 



Assembler Mnemonic 


pointer 


rt 


immediate 


cc 


Lexra SUBOP 


LBP[.Cn] 


pointer 


it 


stride 


cc 


LBP 


LBPU[.Cn] 


pointer 


rt 


stride 


cc 


LBPU 


LHP[.Cn] 


pointer 


rt 


stride/2 


cc 


LHP 


LHPU[.Cn] 


pointer 


rt 


stride/2 


cc 


LHPU 


LWP[.Cn] 


pointer 


rt 


stride/4 


cc 


LWP 


LTP[.Cn] 


pointer 


rt 


stride/8 


cc 


LTP 


SBP[.Cn] 


pointer 


rt 


stride 


cc 


SBP 


SHP[.Cn] 


pointer 


rt 


stride/2 


cc 


SHP 


SWP[.Cn] 


pointer 


rt 


stride/4 


cc 


SWP 


STP[.Cn] 


pointer 


rt 


stride/8 


cc 


STP 



base, pointer, rt Selects general register rO - r3 1 
rt-even 
stride 



displacement 

cc 

00 

01 

10 

11 



Selects general register even-odd pair rO/rl, r2/r3, ... r30/r31 
Signed 2s-complement number in bytes. Must be an integral number of 
halfwords/words/twinwords for the corresponding instructions. 
Signed 2s-complement number in bytes. Must be an integral number of twinwords. 



select circular buffer 0 (cbsO, cbeO) 
select circular buffer 1 (cbsl, cbel) 
select circular buffer 2 (cbs2, cbe2) 
no circular buffer selected 



fie*, n-k 



II. Arithmetic Format 



31 *2fi25 21 20 


1615 1110 9 


8 


7 


6 


5 0 


LEXOP 
OIl.lll 


rs 


rt 


rd 


hi 


0 


s 


d 


SUBOP 



6 5 5 5 2 1 1 1 6 



Assembler Mnemonic 


rs 


ft 


rd 


hi 


s 


d 


Lexra SUBOP 


ADDR[.S],ADDR2[.S] 


rs 


rt 


rd 


0 


s 


d 


ADDR 


SUBR.S, SUBR2[.S] 


rs 


rt 


rd 


0 


s 


d 


SUBR 


SLTR2 


rs 


rt 


rd 


0 


0 


1 


SLTR 


SLLV2 


rs 


rt 


rd 


0 


0 


1 


SLLV 


SRLV2 


rs 


rt 


rd 


0 


0 


1 


SRLV 


SRAV2 


rs 


rt 


rd 


0 


0 


1 


SRAV 


MIN, MIN2 


rs 


rt 


rd 


0 


0 


d 


MIN 


MAX, MAX2 


rs 


rt 


rd 


0 


0 


d 


MAX 


ABSR[.S], ABSR2[.S] 


0 


rt 


rd 


0 


s 


d 


ABSR 


MUX2.[LL,LH,HL,HH] 


rs 


rt 


rd 


hi 


0 


1 


MUX 


CLS 


0 


rt 


rd . 


0 


0 


0 


CLS 


BITREV 


rs 


rt 


rd 


0 


0 


0 


BITREV 



rs. rt. rd 

Selects general register iO - r3 1 . 



s 

Selects saturation of result. s=l indicates that saturation is performed 
d 

« 

d=l indicates that dual operations on 16-bit data are performed. 
hi (Tor MUX21 

00 LL: rD = rs[15:00] II rt[15:00] 

01 LH: rD = rs[15:00] II rt[31:16] 

10 HL:rD = rs[31:16]llrt[15:00] 

11 HH:rD = rs[31:16]llrt[31:16] 



HI. MAC Format A 



31 ~* 26 25 2120 16 15 11 10 9 8 7 6 5 0 



LEXOP 
01 1_1 1 1 


rs 


rt 


md 


0 


u 


gz 


s 


d 


SUBOP 



Assembler Mne- 
monic 


rs 


rt 


md 


u 


07 

6 L 


c 

9 


d 


Lexra SUBOP 


CMULTA 


rs 


rt 


md 


0 


0 


0 


0 


CMULTA 


DIVA(U) 


rs 


rt 


md 


u 


0 


0 


0 


DIVA 


MULTA(U) 


rs 


rt 


md 


u 


1 


0 


0 


MADDA 


MULTA2 


rs 


rt 


md 


0 


1 


0 


1 


MADDA 


MADDA(U) 


rs 


rt 


md 


u 


0 


0 


0 


MADDA 


MADDA2[.S] 


rs 


rt 


md 


0 


0 


s 


1 


MADDA 


MSUBA(U) 


rs 


rt 


md 


u 


0 


0 


0 


MSUBA 


MSUBA2[.S] 


rs 


rt 


md 


0 


0 


s 


1 


MSUBA 


MULNA2 


rs 


rt 


md 


0 


1 


0 


1 


MSUBA 


MTA2[.G] 


rs 


0 


md 


0 


g 


0 


1 


MTA 



rs.rt Selects general register rO - r3 1 . 



md Selects accumulator, ONNHL where, 

NN = mO - m3 
ffl, 

00 = reserved 01 = mNl 

10 = mNh ll = mN 

s Selects saturation of result. s=l indicates that saturation is performed. 

d d=l indicates-that4ual operations on 16-bit data are performed. 

gz For MTA2, used as "guard" bit. If g=l, bits [39:32] of the accumulator (pair) are 

loaded and bits [31:00] are unchanged. If g=0, all 40 bits [39:00] of the accumulator (or pair) are 
updated. 

For MADDA, MSUBA, used as a "zero" bit. If z = 1, the result is added to 
(subtracted from) zero rather than the previous accumulator value; this performs a MULTA, 
MULTA2 or MULNA2. If z = 0, performs a MADDA, MSUBA, MADDA2 or MSUB A2. 

u Treat operands as unsigned values (0 = signed, 1 = unsigned) 



IV. MAC Format B 



31 *2625 2120 16 15 


11 10 7 


6 


5 0 


LEXOP " 
011JU 


00000 


mt 


rd 


so 


d 


SUBOP 



6 5 5 5 4 1 6 



Assembler Mnemonic 


mt 


rd 


so 


d 


Lexra SUBOP 


MFA, MFA2 


mt 


rd 


so 


d 


MFA 


RNDA2 


mt 


0 


so 


1 


RNDA 



rd 



Selects general register rO - r31. 

mt Selects accumulator, 0NNHL where, 

NN = mO - m3 
HL 

00 = reserved 

01 = mNl 
10 = mNh 
ll=mN 

d 

d=l indicates that dual operations on 16-bit data are performed. 

so 

Encoded ("output") shift amount n = 0 - 8 for RNDA2, MFA, MFA2 instructions. 



V. MAC Format C 



31 



-26 25 



2120 



16 15 



LEXOP ' 
Oil 111 



ms 



mt 



md 



11 10 S 7 6 5 



000 



SUBOP 



6 


5 


5 


5 


3 1 1 


6 


Assembler Mnemonic 


ms 


mt 


md 


s 


Lexra SUBOP 


ADDMA[.S] 


ms 


mt 


md 


s 


ADDMA 


SUBMA[.S] 


ms 


mt 


md 


s 


SUB MA 



mt. ms. md 

Selects accumulator, ONNHL where, 

NN = mO - m3 
HL 

00 = reserved 

01 =mNl 

10 = mNh 

1 1 - reserved 



Selects saturation of result. s=l indicates that saturation is performed. 



VI. RADIAX MOVE Format and Lexra-CopO MTLXCO/MFLXCO Instructions 



31 *2625 2120 


16 15 11 


10 3 


7 


6 


5 0 


LEXOP - 
01 LI 11 


00000 


rt 


ru/rk 


000 


k 


0 


SUBOP 



1 1 



Assembler Mnemonic 


rt 


ru/rk 


k 


Lexra SUBOP 


MFRU 


rt 


ru 


0 


MFRAD 


MTRU 


rt 


ru 


0 


MTRAD 


MFRK 


rt 


rk 


1 


MFRAD 


MTRK 


rt 


rk 


1 


MTRAD 



rt 

Selects general register rO - r31. 
rk 

Selects Radiax Kernel register in MFRK, MTRK instructions — currently all reserved. However, 
a Coprocessor Unusable Exception is taken in User mode if the CuO bit is 0 in the CPO Status 
register when MFRK or MTRK is executed. 

ru 

Selects Radiax User register in MFRU, MTRU instructions. 



00000 
00001 
00010 
00011 
00100 
00101 
00110 
00111 
Olxxx 
10000 
10001 
10010 
10011 
lOlxx 
11000 
11001 
lllxx 



cbsO 

cbsl 

cbs2 

reserved 

cbeO 

cbel 

cbe2 

reserved 

reserved 

IpsO 

IpeO 

lpcO 

reserved 

reserved 

mmd 

reserved 

reserved 



F\c*.n F 



Lexra-CoprocessorO Register Access Instructions 



31 



*2€25 21 20 16 15 11 10 



COPO 
010.000 


MFLX 
00011 


rt 


rd 


00000000000 


6 




5 




5 




5 




11 




31 


2625 2120 




1615 1110 




0 


COPO 
010.000 


MTLX 
00111 


rt 


rd 


00000000000 


6 




5 




5 




5 




11 





Assembler Mnemonic 


Copz rs 


rt 


rd 


MFLXCO 


MFLX 


rt 


rd 


MTLXCO 


MTLX 


rt 


rd 



These are not LEXOP instructions. They are variants of the standard MTCO and MFCO 
instructions that allow access to the Lexra CoprocessorO Registers listed below. As with any 
COPO instruction, a Coprocessor Unusable Exception is taken in User mode if the CuO bit is 0 in 
the CPO Status register when these instructions are executed. 



rt 



Selects general register rO - r3 1. 
rd 



Selects Lexra CoprocessorO register: 

00000 ESTATUS 

00001 ECAUSE 

00010 INTVEC 

00011 reserved 
001 xx reserved 
Olxxx reserved 
lxxxx reserved 



VH. CMOVE Format 



31 



•2625 



2120 



LEXOP 








1U ? 


0 o 


> 0 


01 1_1 1 1 

6 


rs 


rt 


rd 


00 


cond 


SUBOP 



Assembler Mnemonic 



CMVEQZ[.H][.L] 



CMVNEZ[.H](.L] 



rs 

rs 
rs 



rt 



rt 
rt 



rd 



rd 
rd 



cond 



cond 
cond 



rs. rt. rd 

Selects general register rO - r3 1 . 
cond 

Condition code for rT operand referenced by the conditional 

000 EQZ 

001 NEZ 

010 EQZ.H 

011 NEZ.H 

100 EQZ.L 

101 NEZ.L 
llx reserved 



move. 



Lexra SUBOP 



CMOVE 
CMOVE 



Fid. PFU 



